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Algorithmically-specialized  computers  are  likely  to  be  parallel 
machines  since  parallelism  is  an  effective  method  of  circumventing  the 
physical  limits  of  switching  and  signal  transmission  delays.  If  so,  then  an 
important  design  decision  is  whether  the  algorithmically-specialized 
computer  executes  synchronously,  asynchronously  or  in  an  intermediate 
mode  such  as  data-driven  execution.  The  decision  is  crucial  because  it 
influences  cost,  performance  and  the  convenience  of  programming.  For 
asynchronous  execution,  there  is  overhead  associated  with  processor  to 
processor  communication  because  of  the  requisite  hand-shaking  proto¬ 
col.  Data-driven  execution  must  be  charged  for  the  additional  circuitry 
needed  to  buffer  data  arriving  at  a  processor  prior  to  its  use  and  to  pro¬ 
vide  a  signalling-back  mechanism  indicating  when  buffer  space  is  avail¬ 
able.  To  their  credit,  both  mechanisms  appear  to  be  easy  to  program, 


although  the  programs  are  subject  to  possible  deadlocks.  Synchronous 
execution  has  none  of  the  overhead  problems,  nor  is  it  subject  to 
deadlock.  However,  assuming  (as  is  reasonable)  that  the  single  "steps"  of 
an  abstract  algorithm  are  implemented  by  varying  numbers  of  more 
primitive  processor  steps,  idles  will  have  to  be  inserted  in  some  proces¬ 
sors  so  that  they  match  the  execution  rate  of  the  processors  with  which 
they  communicate.  There  are  cases  where  this  cannot  be  done.  More¬ 
over,  when  it  can  be  done,  the  resulting  programs  can  be  problem-size 
dependent,  hardware  dependent,  and  extremely  difficult  to  write.  These 
are  important  considerations  that  cannot  be  easily  dismissed. 

In  order  to  evaluate  the  consequences  of  problems  such  as  these,  we 
have  developed  a  model  for  analyzing  general  inter  processor  communica¬ 
tion.  What  makes  the  model  unique  and  especially  useful  for  the  prob¬ 
lems  mentioned  above  is  that  it  is  parameterized  by  execution  mode. 
This  enables  different  execution  modes  to  be  expressed  in  one  formalism 
in  which  fair  and  accurate  comparisons  can  be  made. 

The  purpose  of  the  paper  is  to  present  the  model  in  its  full  generality 
and  to  summarize  our  curly  experience  with  it. 
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A  MODEL  OF  PARALLEL  PROGRAMS 

We  assume  that  a  parallel  processor  is  composed  of  m  processing  ele¬ 
ments  Mm  which  collectively  implement  an  algorithm.  The  pro¬ 
cessing  elements  (PEs)  have  local  memories  for  program  and  data 
storage,  and  they  execute  sequential  programs  under  the  control  of  their 
own  program  counters.  We  are  concerned  only  with  the  input/output 
behavior  of  these  machines.  To  avoid  hiding  communication  costs,  we 
assume  that  the  PEs  do  not  share  any  common  memory;  instead  they 
communicate  through  read  and  write  operations.  On  each  time  step,  a 
PE  can  attempt  a  set  of  I/O  operations  simultaneously.  Whether  or  not 
an  operation  executes  when  it  is  attempted  depends  on  the  execution 
mode.  An  operation  that  does  not  execute  is  retried  on  the  next  step  and 
a  process  does  not  proceed  with  a  new  set  of  operations  until  all  of  its 
current  operations  have  completed. 

We  model  such  systems  as  Mtrrproc***  Communication  (1C)  Sys¬ 
tem*.  An  1C  system  is  completely  defined  by  a  function,  A  (mnemonic  for 
"advance"),  giving  the  execution  mode  of  the  system  and  a  set  of 

sequences  Vt,  V% Vm,  each  describing  the  behavior  of  a  single  PE.  The  <- 

th  sequence  describes  the  behavior  of  the  i-th  machine.  There  are  three 
types  of  operations  which  are  represented  as  follows 
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reads:  the  read  of  value  a  from  PE  i  is  denoted  riiC ; 
writes:  the  write  of  value  a  to  PE  i  is  denoted  , ; 
and 

time  delays:  a  delay  of  n  time  units  is  denoted  (these  delays  are 
used  only  in  asynchronous  mode  as  described  below). 

Each  symbol  in  a  behavioral  sequence  is  a  (possibly  empty)  set  of  these 
operations  subject  to  two  restrictions:  there  is  at  most  one  time  delay 
operation  in  any  set  (if  there  is  no  time  delay,  the  operation  is  assumed 
to  require  one  time  step);  and  there  is  not  more  than  one  read  (write)  to 
(from)  any  PE  in  a  single  set.  Figure  1(a)  is  an  1C  system  representing 
the  systolic  processor  for  band  matrix-vector  multiplication  with  a 
bandwidth  of  four  [6].  ^  The  sequences  of  operation  sets  for  each  PE  are 
specified  by  regular  expressions.  Since  the  system  is  synchronous,  there 
are  no  time  delay  operations  and  since  the  system  does  not  have  data- 
dependent  branches,  we  represent  the  transmitted  values  by  a  single, 
generic  value  x.  Figure  1(b)  shows  the  communication  graph  for  this 
system;  each  vertex  represents  a  PE  and  a  directed  edge  from  node  i  to 
node  j  represents  a  communication  link  over  which  the  i-th  PE  writes  to 
the  jf-th  PE  and  the  jf-th  PE  reads  from  the  i-th  PE. 

t  Note  thnt  in  our  figures  we  use  rectangular  boxes  to  enclose  sets  rather  than 
the  usuul  brace  notation. 


Processor  J: 

53  DE3  DE3- 


Processor  2: 


Processor  3s 


Processor  4 s 


1(a)  1C  system  representing  systolic  processor 
for  band  matrix  -  vector  multiplication. 


1  (b)  Communication  graph  for  the  IC  system  of 
Figure  1 (a) . 

Figure  1. 


We  define  the  execution  of  an  IC  system  in  terms  of  three  sequences, 

Cl,Cfi,C 3 Al,A8,A8,...,  and  Ql,&,Qz .  For  all  *>0,  C*  describes  the  set  of 

operations  thut  are  attempted  on  the  fc-th  execution  step.  A*  describes 
the  time  needed  for  those  operations  to  complete,  and  describes  the 
status  of  communications  if  they  all  do  complete.  Each  element  of  the 
first  sequence  is  an  m. -vector  giving  program  counter  values  (indexes  into 
operation  set  sequences)  for  all  PEs.  Each  element  of  the  second 
sequence  is  an  m -vector  giving  timer  values  (the  number  of  steps  that 
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must  elapse  before  the  completion  of  the  current  operation  set)  for  all 
PR’s.  Each  element  of  the  third  sequence  is  an  mxm  matrix  of  strings, 
giving  the  status  of  communications  in  terms  of  strings  of  messages  and 
requests.  The  status  of  communications  on  the  link  from  PE  i  to  PE  j  is 
given  by  Values  that  have  been  written  but  that  have  not  yet  been 
read  are  denoted  by  elements  of  an  alphabet  E;  values  that  have  been 
requested  but  that  have  not  yet  been  written  are  denoted  by  their 
inverses.^  is  a  queue  of  written  values  (head  on  the  right  end)  fol¬ 
lowed  by  a  queue  of  requested  values  (head  on  the  left  end);  correspond¬ 
ing  writes  and  reads  cancel  at  the  boundary  between  these  queues. 

To  start  the  sequences  we  define,  for  all  i.j  e  [m],  ^  c4*sl  and 

x  n  if  £  K(ct‘) 

“1  otherwise 

and  gij  =  a  b  where 


and 


if  WjjZViict1) 

otherwise 


b 


a“‘  if  rt>„ety(c/) 
\  otherwise 


t  We  represent  the  inverse  of  e  symbol  a  by  0~l  and  define  O  ff-1  to  be  the  emp¬ 
ty  string;  E~l  =  jo-1 1  a  e  Ej 

W  [m]  denotes  the  set  {1,2,3 . m  J. 
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with  V{j)  denoting  the  j-th  set  of  operations  in  the  sequence  V. 

Cl  shows  all  PEs  executing  their  first  set  of  operations,  A1  shows  all  of 
the  timer  values  set  to  their  initial  values,  and  Ql  shows  that  the  initial 
reads  and  writes  are  pending.  The  remainder  of  the  sequence  of  Cs  is 
defined  to  reflect  the  fact  that  a  PE  moves  to  a  new  set  of  operations  only 
if  all  operations  in  its  previous  set  have  completed:  for  k  >0 

f  et*+l 

c*+i  _  < 

1  [  c*  otherwise 

where  A(i,k)  is  true  if  the  i-th  PE  finishes  the  c/'-th  operation  set  in  step 
k.  The  exact  form  of  A  depends  on  the  mode  of  execution  and  is  dis¬ 
cussed  below.  For  Jb>0,  A*  is  defined  so  that  the  timers  are  set  by  the 
execution  of  a  d  operation  (default  d=l)  and  are  decremented  by  1  on 
each  subsequent  step  until  they  reach  0: 


n  if  dn  e  +1)  a  *l 

1  it  dn  </  K(c^‘)  a 

5*  -  1  otherwise  . 


The  remainder  of  the  sequence  of  Qs  is  defined  to  reflect  the  execution  of 
read  and  write  operations:  for  *>0 


qtJ1=a  qtj  b  where 

f  a  if  WjgCViiof*1)  a 
a  “  I  A  otherwise 


and 
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c_1  ^  ri#„eV5(c*+1)  a  cf+l  *  cf 
A  otherwise  . 

We  observe  that  our  execution  rules  are  more  general  and  more  realistic 
than  those  used  in  many  models  because  we  do  not  insist  that  all  of  the 
operations  in  a  set  execute  simultaneously.  Depending  on  the  definition 
of  A,  it  is  possible,  for  example,  to  allow  independent  reading  and  writing 
on  different  ports. 

The  execution  of  an  1C  system  is  parameterized  by  the  predicate  A. 
All  of  the  common  forms  of  execution  modes  can  be  succinctly  expressed 
within  our  model: 


Parameterizing  our  model  with  the  execution  mode  enables  us  to  com¬ 
pare  modes  and  it  distingir  jhcs  our  model  from  previous  formal  models 
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of  computation  such  as  the  model  proposed  by  Lipton,  Miller  and  Snyder 
l?],  Petri  Nets  [8],  Lhc  vector  addition  system  model  [b]  and  the  model 
developed  by  Arjomandi,  Fischer  and  Lynch  [1]. 


PRELIMINARY  RESULTS 

Our  initial  work  has  been  practically  motivated:  we  would  like  to  be 
able  to  program  algorithms  for  the  CHiP  machine  [9].  In  particular,  we 
are  working  with  an  architecture  in  which  computational  operations  are 
executed  synchronously  but  I/O  operations  may  be  either  asynchronous 
or  synchronous.  In  asynchronous  mode,  a  read  that  occurs  before  the 
corresponding  write  is  delayed  and  a  write  that  occurs  before  the 
corresponding  read  interrupts  the  destination  PE  to  buffer  the  transmit¬ 
ted  value.  In  synchronous  mode,  I/O  interrupts  are  masked  off  and 
corresponding  reads  and  writes  must  occur  simultaneously.  A  program 
that  can  be  run  fully  in  synchronous  mode  is  said  to  be  coordinated. 

We  would  like,  whenever  possible,  to  run  coordinated  programs. 
Unfortunately  it  is  extremely  difficult  for  programmers  to  produce  such 
code  and  the  the  code  itself  is  often  problem-size  and  hardware  depen¬ 
dent.  To  surmount  these  problems,  we  have  developed  and  implemented 
algorithms  for  the  automatic  synthesis  of  coordinated  programs  from 
data-driven  programs  [2].  These  algorithms  enable  the  programmer  to 
work  in  the  more  natural  data-driven  environment  without  forfeiting  any 
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of  the  advantages  of  a  synchronous  device.  They  apply  only  to  loop  pro¬ 
grams  in  which  each  PE  executes  a  single  loop  of  instructions.  (This  res¬ 
triction  at  first  may  seem  quite  prohibitive  but,  in  fact,  most  recent  algo¬ 
rithms  for  algorithmically-specialized  computers  are  loop  programs.  In 
addition,  many  programs  can  be  viewed  as  loop  programs  by  collapsing 
parallel  branches  that  have  the  same  input/output  behavior.) 

We  have  developed  two  synthesis  algorithms.  The  first,  the  Wave  Algo¬ 
rithm,  works  on  all  data-driven  loop  programs  for  which  conversion  is 
possible  but  in  some  cases  it  produces  inefficient  code.  The  second  algo¬ 
rithm,  tne  Buffered  Write  Algorithm,  works  for  only  a  subset  of  loop  pro¬ 
grams  and,  although  it  often  increases  the  length  of  PE  code  significantly, 
the  results  are  more  efficient.  We  are  currently  working  to  expand  the 
class  of  programs  that  we  can  convert  and  to  develop  measures  for  accu¬ 
rately  evaluating  the  efficiency  and  trade-ofTs  of  our  algorithms. 

For  the  programs  that  we  cannot  coordinate  or  that,  for  reasons  of 
efficiency,  require  manual  design,  wc  have  developed  and  implemented 
algorithms  for  testing  coordination  [3].  We  have  efficient  algorithms  for 
testing  the  coordination  of  loop  programs  and  the  worst-case  coordina¬ 
tion  of  arbitrary  programs.  The  general  testing  question  is  PSPACE-hard 
[4J.  We  expect  that  as  libraries  of  parallel  programs  become  available, 
our  algorithms  will  be  useful  in  determining  their  interface  compatibili¬ 


ties. 
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